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LOCAL LINEAR CONVERGENCE OF ISTA AND FISTA ON THE 

LASSO PROBLEM * 

SHAOZHE TAOt, DANIEL BOLEYt, AND SHUZHONG ZHANG§ 

Abstract. We establish local linear convergence bounds for the ISTA and FISTA iterations on 
the model LASSO problem. We show that FISTA can be viewed as an accelerated ISTA process. 
Using a spectral analysis, we show that, when close enough to the solution, both iterations converge 
linearly, but FISTA slows down compared to ISTA, making it advantageous to switch to ISTA toward 
the end of the iteration processs. We illustrate the results with some synthetic numerical examples. 


1. Introduction. The /i-norm regularized least squares model has received 
much attention recently due to its wide applications in the real problems includ¬ 
ing compressed sensing [5], statistics [7], sparse coding [10], geophysics [11] and so on. 
The problem in question is: 

(1.1) min y 2 ||Ax-b|[^ + A|[x||i 

where A € is a given matrix, b is a given vector and A is a positive scalar. 

The idea of h regularization is decades old, but the least squares problem with 
li penalty was presented and popularized independently under names Least Absolute 
Selection and Shrinkage Operator (LASSO) [17] and Basis Pursuit Denoising [5]. For 
example, in compressed sensing, we are interested in recovering a solution x to an 
undetermined system of linear equations Ax = b in the case where n ^ m. The linear 
algebra tells us that this linear system either does not exist or is not unique when 
the number of unknowns is greater than the number of equations. The conventional 
way to solve the system is to find the minimum l 2 -norm solution, also known as linear 
least squares. However, if x is sparse, as very common in many applications, then x 
can be exactly recovered by computing the above /i-norm regularized least-squares 
model. Since LASSO becomes the dominant expression describing this model, we will 
use term LASSO to denote the above model for the remainder of the paper. 

Although the LASSO problem can be cast as a second order cone programming 
and solved by standard general algorithms like an interior point method [2], the com¬ 
putational complexity of such traditional methods is too high to handle large-scale 
data encountered in many real applications. Recently, a number of algorithms that 
take advantage of the special structure of the LASSO problem has been proposed. 
Among them, two remarkable ones are iterative shrinkage thresholding algorithm 
(ISTA) and its accelerated version fast iterative shrinkage thresholding algorithm 
(FISTA). 

ISTA is also known as the proximal gradient method and its computation only in¬ 
volves matrix and vector multiplication, which has great advantage over standard con¬ 
vex algorithms by avoiding a matrix factorization [16]. Recently, Beck and Teboulle 
[I] proposed an accelerated ISTA, named as FISTA, in which a specific relaxation 
parameter is chosen. A similar algorithm to FISTA was also previously developed 
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independently by Nesterov in [13]. Both two algorithms are designed for solving prob¬ 
lems containing convex differentiable objectives combined with an li regularization 
term as the following problem: 

min{/(x) + g{x.) : x G K"} 

where / is a smooth convex function and g is a continuous function but possibly 
nonsmooth. Clearly, the LASSO problem is a special case of above formulation with 
/(x) = y 2 ||Ax — bp, g{x) = ||x||i. Its gradient V/ = A'^Ax — A'^b is Lipschitz 
continuous with constant L{f) = 2p{A^A) = [[A’^^Ap, i.e., ||V/(xi) — V/(x 2 )|| < 
L(/)||xi—X 2 II, Vxi, X 2 G R". It has been shown [1] that FISTA provides a convergence 
rate of 0(I/fc^) compared to the rate of 0(1/ k) by ISTA, while maintaining practically 
the same per-iteration cost, where k is the iteration number. However, in contrast to 
the results of a global convergence rate, as far as we know, there is no result on the 
local convergence behavior of standard ISTA and FISTA. 

In this work, we establish local bounds on the convergence behavior of ISTA 
and FISTA on the LASSO problem. Comparing the two methods, we show how 
FISTA can be considered as an accelerated ISTA, but as one approaches the solution, 
ISTA can actually be faster. Extending the same techniques as in [3], we show that 
linear convergence can be reached eventually, but not necessarily from the beginning. 
Specifically, we give a way to represent the ISTA and FISTA iteration as a matrix 
recurrence and apply a spectral analysis on the corresponding eigenvalue problem. We 
analyze the local behavior as it passes through several phases or “regimes”, treating 
each regime separately. Based on the spectral analysis, each possible regime one 
can encounter during the course of the iteration is characterized. Under normal 
circumstances, the theory predicts that either ISTA or FISTA should pass through 
several stages or “regimes” of four different types, some of which consist of taking 
constant steps, but finally reaching a regime of linear convergence when close enough 
to the optimal solution. Besides, with our analysis, more properties on FISTA and 
ISTA on the LASSO problem can be derived. 

Throughout this paper, all vector and matrix norms are the I 2 norms (the largest 
singular value for a matrix) unless otherwise specified. For real symmetric matrices, 
the matrix 2-norm is the same as the spectral radius (largest absolute value of any 
eigenvalue), hence we use those interchangeably for symmetric matrices. 

The paper is organized as follows. Section 2 gives some basic preliminaries of the 
paper. We introduced how to formulate ISTA and FISTA into a matrix recurrence 
form in Sections 3.1 & 3.2 and then derived spectral properties of the matrix operators 
in Sections 3.3 & 3.4. Section 4 gives details about four types of regimes that ISTA 
and FISTA will encounter in the iterations process based on our spectral analysis. 
Our first main result is given in Section 5, which shows the local linear convergence 
of ISTA and FISTA on the LASSO problem. In Section 6 we compare the behavior 
in each regime, showing that FISTA can be faster that ISTA through most of the 
regimes, but asymptotically can be slower as one approaches the optimal solution. 
Section 7 includes two numerical examples run by the standard ISTA and FISTA, to 
illustrate many of the predicted behaviors. 

2. Preliminaries. 

2.1. Optimality condition of the LASSO problem. The first order KKT 
optimality conditions for the LASSO problem (1.1) are 

A^(b- Ax) = Aiz 


( 2 . 1 ) 


2 


where each component of u satisfies 


( 2 . 2 ) 


{ Vi = sign(a;i) if ^ Ol 
( — 1 <Vi<+\ if Xi = Oj 

for 

function is defined as 

1 

f+1 

if a; > 0 

sign(x) = 


if a; = 0 

1 

-1 

if a; < 0 


for * = 1, 2, • • • . 


2.2. Uniqueness. There are various sufficient and necessary conditions for the 
uniqueness of the LASSO problem or its variants. For example, [15, 4, 8] show different 
sufficient conditions and [18] studies the necessary conditions for the LASSO problem. 
In fact, the problem (1.1) needs to have a unique solution in many situations. For 
example, in compressive sensing signal recovery, having non-uniqueness solutions will 
result in unreliable recovery given the data. We refer readers to [18, 19] and references 
therein for the uniqueness of the LASSO problem. 

2.3. ISTA and FISTA iteration. In this part, we review the basic iteration 
of ISTA and FISTA for solving the LASSO problem. To make clear the difference 
between ISTA and FISTA, we let x and x denote the iterates of ISTA and FISTA 
respectively in the remainder of this paper. The basic step of ISTA for the LASSO 
problem can be reduced to [6, 1] 

xI'^+H = argmin{5(x) + (^''=1 - yLV/(xW))f} 

X 

= argmin{A||x||i + - P'"' - ^^(ATAxW - } 

X 

= Shr,^J(J - y^ATA)xW + y^ATb). 

One advantage of ISTA is that the above step can be solved in closed form, 
leading to the following updates repeated until convergence, where xl*^! denote the 
vectors from previous pass, and L is the given constant equal to ||A"'"A|| 2 . 

Algorithm 1: One pass of ISTA 
start with xl*^l. 

Set = ShrA/^((/ - yiA'rA)xW -t i/lA'^L). 

Result is x[^+^l for next pass. 


As for FISTA, the difference from ISTA is that the shrinkage operator is not 
employed on the previous point but a point , which uses a very specific linear 

combination of the previous two points {x[^“^l,x[^“^l}. The algorithm of FISTA 
for LASSO problem is presented as below, where the initial = x^^^ G K" and 

= fW = 1. _ 

Algorithm 2; One pass of FISTA 
start with and 

1. Set yW = - 5[fc-2]). 

2. Set = ShrA/^((/ - yiATA)yW -fi y^ATb). 

3. Set = i+Vi+4tW\ 

Result is xl^l, x[^“^l for next pass. 
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3. Auxiliary Variables with Local Monotonic Behavior. 

3.1. ISTA as a Matrix Recurrence. Instead of carrying the iteration using 
variables x^^l, we use two auxiliary variables to carry the iteration. One variable, 
namely, w^^'l exhibits smooth behavior, with linear convergence locally around a fixed 
point, and the other variable is simply a ternary vector based on the three cases 
of the shrinkage operator. We let, for all k, the common iterate be 

(3.1) wW = {I- yLATA)x['=l + y^A'^b 
and vector be defined elementwise as 

f 1 if {yf' > 

(3.2) = sign(ShrA/ wf')) = < 0 if - 

[-1 ifwf'<-yi. 

By the updating rule in Alg. 1 and above two equations, one can obtain the 
xl^l-update in terms of and d^^l 

(3.3) x['=+il = ShrA^^(w['=l) = (5^)2-[fe] _ a/^JW 

where the matrix = diag(d[^l). Using (3.1), (3.2) and (3.3), the update formula 
for w now can be expressed explicitly as follows: 

-I- 

= [(/ - yiA^A)(i3['=l)2].^[fe] _ (-J _ -k y^A^b 


where we denote 

rW = [(J-y^ATA)(5W)2] 

hW = -(j-yiATA)yidW-kyiA^b 

throughout this paper. Therefore, the ISTA in Alg. 1 with variable x can be modified 
to the following procedure using the new variables w and D. 

Algorithm 3: One pass of modified ISTA 
start with 

p .^[fc+i] = jiW^lk] (with i?W,h['=l defined by (3.4). 

2. = DiAG(siGN(ShrA^^(w['']))). 

Result is D[^+^1 for next pass. 


Alg. 3 is mathematically equivalent to Alg. 1 and is designed only for the purpose of 
analysis, not intended for computation. We note that step 1 of Alg. 3 can be written 
as a homogeneous matrix recurrence in (3.5), which we will use to characterize ISTA’s 
convergence. 


(3.5) 


^(/-yiATA)(i5W)2 
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where we denote R] 


[k] 

aug 


as 


/ Rlk] 

I 0 1 


the augmented matrix of in this paper. 


The following lemma shows the fixed point of Alg. 3 is a KKT point of the LASSO 
problem and vice versa. 


Lemma 3.1. Suppose 


is an eigenvector corresponding to eigenvalue 1 of 


Rf, 


r (omitting [k]) in (3.5) and D = = DiAG(d) with entries di = 1 

if Wi > ^/i^, di = —1 if Wi < — "Yz, and di = 0 if — Yl Wi ^l- Then the variable 
defined by Sc = Shrx^ (w) satisfies the 1st order KKT condition. Conversely, if Si and 

V = y;)jA'^(b — Ax) satisfy the KKT condition, then w = x + is 

an eigenvector of Raug corresponding to eigenvalue 1, where Raug is defined as in 

(3.5) and = D = DiAG(d) with entries di = 1 if Wi > ^ji^, di = —1 if 

Wi < --Yz and di = 0 if -^z < m < "Yz- 
Proof. From (3.5), we have 

(3.6) {A^AD^)w + L{I - = ^/^A^Ad + A'^b - Ad 

and with the definition 


(3.7) Xi = ShrA^^(iCi) = < 


w^ - ^z if > ^z 

0 if - ^z <Wi< ^z 

Wi + ^z if Wi < -^z 


di = l 
di=d 
di = -1 


we define a set £ s.t. 

£ = 


I* e {I,-- - ,n} : |d)| = l| = {i e {I,-- - ,n} : Xi ^ 0} 
and denote £ as the complement set of £. Then immediately by (3.7) 

(3.8) dg =0, % = 0, % e [-^z, Y'l] and (d^:). = 


1 if (xz)^ > 0 
-1 if (xz)^ < 0 


We split matrix A and do the permutation such that A = [Ag, A^]. So (3.6) becomes 


(3.9) 


[AjAe 0 
[a^As 0 


Wg 


+ L 


OfA 0f.£ 


_ (AjAe AJA^\ /dA /(ATb)A ( 

- /- [ a - ia , a ^ a ^) (0) + ((/i-bv) - ( 


Wf 

W; 

dg 

0 


To satisfy the KKT condition, it is sufficient to show the following two equations 
(3.10) AJ (b - Agxf - Ag%) = 


A if (xz). > 0 
-A if (xz). < 0 


(3.11) A|(b - AfXz - AjSij)\ < A. 

To show (3.10), considering the case in £ 
Equation (3.9) = 


AjAfWz = ^/j^AjAsds + (A'^b)^ - Adz 
Aj'Az(wz - Ade) = (A'^b)z - Adz 
(A'^b)z - AjAzxz = Adz. 
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To show (3.11), considering the case in E 

A^As-ws + L-w-^ = 

A^As{-we - ^L^e) + Lwj = (A'^b)^ 

A^AsSie + Lwj = (^'’"b)^ 

Lwj = (A'^b)j - AlAgSis 

|A|(b - Afxe)! < A. 

Conversely, from KKT condition (2.1), we have y^A"^b — ^j^A^Ax = ^jiV and 
then X = (/ — A)x. + ^AAAh — ^Au. By KKT condition (2.2), we obtain 

X = ShrA^^((/ - ^/lA^A)x + Vi^^b). 

This shows that KKT point of the LASSO can be written in the form of ISTA iterates. 
Then by one pass of our transformation, we get the matrix recurrence form as (3.5). 
□ 


Equation (3.9) 


3.2. FISTA as a Matrix Recurrence. Similar to ISTA, we use auxiliary vari¬ 
ables to replace variable x^^l for carrying the FISTA iterations. We set 


(3.12) wl'^l = (I - yiA^A)y['=l -h ^AA^h. 

Hence, 

(3.13) = ShrA^^(wW) = 

where for all k, the matrix = diag(d[^l), and the vector d^^l is defined elementwise 
as 


(3.14) 




sign(ShrA^^wf')) = < 


1 

0 

-1 


if wf^ > ^A 
if - ^A ^ ^ 

if wf^ <-^Il- 


Using (3.12), (3.13) and the updating formula in Alg. 2, we arrive at 
(3.15) 

(/ - y^A^A) 

■"-1 I i)((i5W)2wW - y^dW) 


W 


[fc+l] _ 


= (/-y^A^A) 


(tt^ 

,[fcl_ 


-(/ - y^A^A) ^((z?['=-ii)2w['=-ii + y^di'^-H) + y^ATb 


[(/-yL4i^A)(5['=-ii)2^ 

+(/ - y^A^A) [-(1 + g^)yidw + + y^A^b 

(1 _|_ .rW)RWwW - -I- h 

pWwW -I- _|_ hW 


wife] 

w[fe~^] 


6 





















where we denote 


T-[fcl 

p[k] 

(3.16) 

R[k] 

h[fe] 


(l + rW)i?['=l 


(/- yi^TA)(i5W)2 

(/ - ^/lA^A) [-(1 + T['^'l)y/^dW + 


+ 


in the rest of this paper. Note that in (3.4) refers to the mapping at the fc-th 
iteration of ISTA while in (3.16) refers to the mapping that would occur if one 
took one step of ISTA starting at the fc-th iterate of FISTA. For the purposes of 
analysis, the modified FISTA iteration then can be equivalently expressed as in Alg. 

4. 


Algorithm 4: One pass of modified FISTA 
start with and 

.^[fc+i] =p[fc]wW+Q[fc-ilw['=-il+hW (with pW,g['=-il,hW defined by (3.16)). 
2 ^ ^ i+Vi+4ti'-i- go ^[k] ^ g^_ 

3. £>[''+^1 = DiAG(siGN(ShrA^^(w[''l))). 

Result is wl^l, and for next pass. 


Step 1 of above procedure can also be formulated as a homogeneous matrix recurrence 
analogous to (3.5) for ISTA, but with a larger (approximately double) dimension: 


(3.17) 



—ivlfc] 

^ aug 



0 1 


/ pWi 


h[fc]\ / 

I 

0 

0 

V 0 

0 

1 / \ 



We denote = 


'y" 0';-') .„dhw=(‘'y) such that n£i,=(";' y' 

in the remainder of this paper. 

The following lemma shows the fixed point of Alg. 4 is a KKT point of the LASSO 
problem and vice versa, analogous to Lemma 3.1 for ISTA. 

Lemma 3.2. Suppose W 2 is an eigenvector corresponding to eigenvalue 1 

Vi/ ^ ^ ^ 

o/Naug (omitting [k]) in (3.17), then Wi = W 2 := w. Suppose further that D = 
£)[fe+i] _ p)[k] _ i3jAG(d) with entries di = 1 if Wi > di = —1 if Wi < —^/l o.nd 


di = 0 if —^/l Si Wi "Vl- Then the variables defined by Ti = Shrxj (w) satisfies 
the 1st order KKT conditions (2.1) (2.2). Conversely, if Si and u = — Ax) 


' w> 


"aug 


satisfy the KKT conditions, then ( w , withw = x + ^/j^u, is an eigenvector o/Ng 
corresponding to eigenvalue 1, where Ngug is defined as in (3.17) and D = = 
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]jVA = DiAG(d) with entries di = 1 if Wi > di = —1 if Wi < —^/l o,nd di = 0 if 
-7l <w^< ^/l- 

Proof. The first statement is directly from the second block row of (3.17). With 
D = _ ]jlk]^ fixed point of (3.17) becomes (omitting the superscripts k) 

/w\ fP Q h\ /w\ 

(3.18) ®l(^l 

\0 0 ij 

yielding 

(3.19) {A^AD^)w + L{I - = ^/LA^Ad + A^h - Ad 

which is exactly (3.6). So the rest part is the same as ISTA. 

Conversely, for the given KKT point x, If we define y = x + ^-^pq^(x —x) where 

is updated as = i±Vi+4ti^^ could show, just as in the proof of Theorem 
3.1, that X = ShrA^^((/ - yi^APA)Si + YiA'^b) = ShrA^^((/ - yiA'^A)y + y^A'^b). 

This is exactly the FISTA iterates. By one pass of our transformation, we obtain the 
matrix recurrence form as (3.17). □ 

For the preparation of the further discussion, we make three remarks. 

(a) , —> 1 from below as k — > oo. 

(b) . .RW = i?W if 51'=] = I)W and hW = if This 

observation relates the Raug to Naug- It is the foundation upon which we establish 
the properties to compare ISTA and FISTA in Section 6. 

(c) . One main difference between operators of ISTA and FISTA (i.e. Raug and 

[All ffcl 

Naug) is that Rkug is fixed when the flag matrix is fixed while Nkkg changes at each 

step k even if the flag matrix is fixed. In other words, for all k, Railg = Rk^ug^^ if 

]j[k] ^ f)[fe+i]. But Nkklg nLuV' even if The reason is that Nkklg 

depends on the changing stepsize . Nevertheless, one can still use the same similar 
ffcl ffcl . . 

argument for Nkug as for Rkug by additional lemmas, as we will show in Section 4. 

3.3. Properties of Rkklg- It is seen that R^ilg and Nkklg play key roles in the 

ffcl 

convergence. Hence we now focus on the spectral properties of Rkug in this part and 
ffcl 

Nkkg in next part. Before that, we first recall some theory relating the spectral radius 
to the matrix norm. 

Theorem 3.3. Let p{M) denote the spectral radius of an arbitrary square real 
matrix M, and let IIMil 2 denote the matrix 2-norm. Then we have the followinq: 

(a) . p{M) < IIMII 2 . 

(b) . //IIMII 2 = p{M) then for eigenvalue A such that |A| = p{M), the algebraic 
and geometric multiplicities of X are the same (all Jordan blocks for A is 1 x 1 j. Such 
a matrix is said to be a member of Class M [12, 14-]. 

(c) . If a X such that |A| = p{M) has a Jordan block of dimension larger than 1 
(the geometric multiplicity is strictly less than the algebraic multiplicity), then for any 
e > 0 there exists a matrix norm || • ||g (based on a nonsingular matrix G) such that 
p{M) < ||M||G<p(M) + e. 

We refer reader [3, 14, 12, 9] for the proof of the above theorem. 

Lemma 3.4. Regarding ISTA, there are three properties of 
(a). ||AW|| = ||(/-y^ATA)(5W)y|<l. 





(b) . All eigenvalues must lie in the interval [0,1]. 

(c) . If there exists one or more eigenvalues equal to 1, then eigenvalue 1 must 
have a complete set of eigenvectors. 

Proof. We here omit the pass number [^1 for simplicity. 

(a) . \\R\\ = \\iI-yi^A^A)D^<\\iI-^/^A^A)\\\\D^<l. 

(b) . The eigenvalue of R is the same as the eigenvalue of R' = D{I — ^j^A^^A)!). 
Noticing L = ||A'^A|| 2 , we obtain ||i ?'||2 < ||I1^||||(/— '^/i^Af^A)\\ < 1. In addition, R' 
is symmetric and a positive semidefinite matrix. Hence all eigenvalues of R' must lie 
in the interval [0,1]. Hence should those of R. 

(c) . Because p{R) = ||i ?||2 = 1, this statement follows directly from Lemma 3.3. 

□ 


3.4. Properties of Naug- Now we turn to the spectral analysis of the FISTA 

fA;! 

operator. Lemma 3.5 demonstrates the eigenstructure of N^ug and its relation to 
that of Raug. 

Lemma 3.5. We let 7 and ft denote the eigenvalue of and respeetively. 

Any eigenvector ( corresponding to eigenvalue 7 must satisfy Wi = 7 W 2 . 

\W2 J 

Suppose and hence then (omitting index 1^1 ) we have the 

following results: ^ 

(a) . W 2 must also be an eigenvector of R with eigenvalue (3, where (3 and 7 has 
the relation 7 ^ — 7(1 + t)/ 3 + r/3 = 0 . 

(b) . ForO <T <1, the eigenvalues of N defined in (3.17) lie in the closed disk in 

the complex plane with center 1/2 and radius denoted as T’(y 2 ) 72 )- particular, 
if N has any eigenvalue with absolute value p{N) = 1, then that eigenvalue must be 
exactly 1. ^ 

(c) . N has an eigenvalue equal to 1 if and only if R has an eigenvalue equal to 1. 

(d) . Assuming T < 1, then if has an eigenvalue equal to 1, this eigenvalue 
must have a complete set of eigenvectors. 

Proof. By the definition of (just after (3.17)) 


M[k] . 1^2^ ^ /wA 

VW 2 ) \ I 0 ) \ V/2) \ Wi ) ^2 J 


and thus "Wi = 7 W 2 is observed from the second row. 

(a). 

N ■ (= (^ ^\ (+ t)jRw2 - tR\V2\ ^ ( 7W2\ 
yW2 j \I 0) \ 2 J V 7W2 J V W2 y 


and therefore, 

(3.20) Pw 2 = 


7 


-W2 = /3w2 


7 ^ - (1 + r)7/9 + t/3 = 0. 


[(1 + t )7 - r] 

(b). We first study the spectrum of matrix N — ^( 2 !, then the spectrum of N 
should be a shift by y 2 /. Let a = 7 — y 2 be the eigenvalue of — '^( 2 ! associated with 

Wi^ 


eigenvector 


W 2 


, then according to (3.20), a and j3 have the relation 


+ (1 - /3 - r/3)a + 1 / 2 r/3 - 1 / 2/3 + 1/4 = 0. 
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Note that r G ( 0 , 1 ] and fj G [ 0 , 1 ] by definition of R. We consider two situations for 
the above quadratic equation. First, suppose ai and a2 are two conjugate complex 
roots. Then ai = a2, ai + a2 = t /3 + /3 — 1 and aia2 = ^ 12^13 — 72/3 + 74 such that 

lap = \aidi\ = \aia2\ = \%tP - %I3 + 74 ] < ^ 

which gives |a| < 72- The second situation is that two roots are real numbers and 
must look like 

1 + T V?\/(l + 'r)^/3-4T 3 + t V7\/(l + T)2/3-4r 

«I = —<3+-^- 'h «2 = —3-^- % 

To get the largest possible value of a, we only look at ai because ai > 02 for any 
fixed /3. Since ai is an increasing function of /3 and /3 G [0,1], the largest real value of 
a should 72 when /3 = 1 . On the other hand, to get the smallest negative real value 

of a, we only need to look at a2. One can write a2 = T ^(/3 — ^) — 72 fo 

see that a2 > —72- So we conclude that if a is real, then —72 < a < 72- 

Under both situations, one can conclude a must satisfy |a| < 72 j lying in n disk 
centered at 0 with radius 721 i-e- T>( 0 , 72)- So the eigenvalues of N must lie in the 
disk 23 ( 72 ) 72 ) dy the shift. Consequently, the only possible eigenvalue on the unit 
circle is 1 . 

(c) . 71,72 are the two roots of the quadratic polynomial, i.e. 7^ — (1+t)7/3+t/3 = 
(7 _ 7i)(7 _ 72) =0. For given / 3 , they must satisfy 

7172 ='r /3 and 71 + 72 = (1 + t)/3 =/3 + 7172. 

If N has an eigenvalue 71 = 1, then 72 = (1 + r)/3 — 1 = /3 + 7172 — 71 = /3 + 72 — 1, 
hence /3 = 1 must be true and R has an eigenvalue equal to 1. Conversely, if R has an 
eigenvalue /3 = 1, the quadratic polynomial (3.20) will reduce to 7 ^ — (1 + t )7 + t = 0, 
which gives 71 = 1 and 72 = r. Then N has an eigenvalue equal to 1. 

(d) . Notice in ( 3 . 20 ) that each eigenvalue /3 of i? maps to two eigenvalues of iV, 71 

— f 'T1W2 

and 72, and associated eigenvector W2 of R maps to two eigenvectors of A^, ~ 

\ W2 

and As shown in (c), N has an eigenvalue equal to 1 if and only if R has 

\ W2 y 

an eigenvalue equal to 1 . Since R and R have the similar eigenstructure, eigenvalue 
1 of i? must have a complete set of eigenvectors. So the only possible situation that 
eigenvalue 1 of does not have a complete set of eigenvectors is that both 71 and 
72 equal to 1 . However, this is impossible because we have shown in (c) that /3 = 1 
gives 71 = 1 and 72 = t which is close to 1 but not equal. As a result, if N has an 
eigenvalue 1, and then its algebraic and geometric multiplicities coincide. □ 

4. Regimes. Since the ISTA and FISTA updating steps have been converted 
into a variation of an eigenproblem in previous sections, we can study the convergence 
in terms of the spectral properties of the operator Raug in ( 3 - 5 ) and Naug in ( 3 . 17 ). 
Hence in this section, we show how the spectral properties of Raug, Naug are reflected 
in the possible convergence “regimes” that ISTA and FISTA can encounter. 

4.1. Spectral Properties. The eigenvalues of the augmented matrices Raug 
and Naug consist of those of R, N plus an extra eigenvalue equal to 1 , respectively. If 
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R (or N) already has an eigenvalue equal to 1, then the extra eigenvalue 1 may or may 
not add a corresponding eigenvector. The next lemma gives limits on the properties 
of the eigenalue 1 for any augmented matrix of the general form of Raug, Naug- 

Lemma 4.1. Let Maug = be any block upper triangular matrix with a 

1x1 lower right block. The matrix Maug has an eigenvalue ai = 1 and suppose its 
corresponding eigenvector has a non-zero last element. We scale that eigenvector to 

take the form ( ^ j = Maug f ^ ] ■ V the upper left block M either has no eigenvalue 


egual to 1 or the eigenvalue 1 of M has a complete set of eigenvectors, then ai = 1 

/w\ 

has no non-trivial Jordan block. Furthermore, if the given eigenvector ( ^ 


then M has no eigenvalue equal to 1. 

We refer readers [3] to the proof of Lemma 4.1. Now we summarize spectral 
properties of our specific operators Raug and Naug in terms of their possible Jordan 
canonical forms as given in the following lemmas. Essentially these lemmas say that 
all their eigenvalues must have absolute value strictly less than 1 , except for the 
eigenvalue equal to 1. And the eigenvalue I’s geometric multiplicity either equal to 
or one less than its algebraic multiplicity. 

Lemma 4.2. Assuming then Ri^ig in (3.5) is fixed and has a 

spectral decomposition Ri^ug = where is a block diagonal matrix 


(4.1) 


J^l = DlAG 


Pk] <j[fc] 


where any of these blocks might be missing. Here is an identity matrix and 
is a matrix with spectral radius strictly less than 1 . 

Proof. For Railg, the upper left block of (3.5) (i.e. satisfies Lemma 3.4 and 

\k] '^\k] 

hence contributes blocks of the form , J'jjb No eigenvalue with absolute value 1 
can have a nondiagonal Jordan block, so the blocks corresponding to those eigenvalues 
must be diagonal. Embedding that upper left block into the entire matrix yields 
a matrix Rkug, with the exact same set of eigenvalues with the same algebraic and 
geometric multiplicities, except for eigenvalue 1 . 

If the upper left block of Ralg has no eigenvalue equal to 1, then Ralg has a 
simple eigenvalue 1. In general for eigenvalue I, the algebraic multiplicity goes up 

by one and the geometric multiplicity can either stay the same or increase by I. In 
Ik] 

other words, Rkug either satisfies the conditions of Lemma 4.1, or the algebraic and 
geometric multiplicities of eigenvalue I for Rkug differ by I, meaning we have a single 

2x2 Jordan block [q jV 


Lemma 4.3. Assuming j since nL'^Is in (3.17) is different at each 

\k] \k] 

step, for each step k, there exists a P)y such that Nkug has a spectral decomposition 
Naug = Pjv^ (P{v^)~^; where is the block diagonal matrix: 

(4.2) 41=Diag|(^J 

where any of these blocks might be missing. Here is an identity matrix, is a 
matrix with spectral radius strictly less than 1. 
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[fel 

Proof. The proof is similar to Rkug- We only note here that the upper left block of 
(3.17) (i.e. N) satisfies Lemma 3.5 and hence contributes blocks of the form 
□ 


4.2. Four regimes. Lemma 4.2 & 4.3 give rise to the four possible “regimes” 
associated with the ISTA and FISTA iterations, depending on the flag matrix and 
the eigenvalues of We treat separately the case where the flag matrix 

remains the same at each iteration, in which there are three possible regimes, and 
treat all the transitional cases together in their own fourth regime. For simplicity of 
the notation, we let D denote the flag matrix instead of D and D unless specified. 

When the flag matrix does not change, i.e. the ISTA operator 

Ra^ug remains invariant over those passes while the FISTA operator N^^ug is slightly 
different at each iteration due to the changing parameter In both cases, the 

structure of the spectrum for that specific operator controls the convergence behavior 
of the process during these passes. We summarize as follows three specific possible 

[fcl [All 

regimes distinguished by the eigenstructure of the operators Rkug, Nkkg- One of the 
these three regimes must occur when the flag matrix is unchanged from one step to 
the next: £>['=+11 = £['=1. 

[A] . The spectral radius of (or is strictly less than I. The block 

is absent from (4.1) (or (4.2)), and the block (or /^^) is I x I. If close enough to 
the optimal solution (if it exists), the result is linear convergence to that solution. 

For R[Sg, as long as the flags remain the same, the recurrence (3.5) hence will 
converge linearly to a unique fixed point at a rate determined by the next largest 
eigenvalue in absolute value (largest eigenvalue of the block J^^), according to the 
theory for the power method. If the flags are consistent with the eigenvector 
satisfying (3.2), then the eigenvector must satisfy the KKT condition because of 
Lemma 3.1. 

[fcl 

For Nkiig, though it changes slightly at each step, yet we will show later that the 
left and right eigenvectors for eigenvalue I do not depend on r (Lemma 5.4), and the 
remaining eigenvalues remain bounded away from I. The result is that we observe 
linear convergence to a unique fixed point with the slightly changing convergence 
rate. If the flags are consistent with the eigenvector satisfying (3.14), then that 
eigenvector must satisfy the KKT condition because of Lemma 3.2. 

[B] . (or A^I^l) has an eigenvalue equal to I which results in a 2 x 2 Jordan block 

^ for Ri'dg (or Naig)- Therefore, the iteration process tends to a constant step. 




For R; 


[fe] 


the theory of power method implies that the vector iterates will con¬ 


verge to the invariant subspace corresponding to the largest eigenvalue 1. The presence 

of means that there is a Jordan chain: two non-zero vectors q, r such that 

(Raug — I)^ — r, (Raug ~ d)r = 0. Any vector which includes a component of the 
form aq-f/Jr will be transformed into Raug(Q:q-l-/3r) = aq-f (a -|- /3)r, i.e. each pass 
would add a constant vector ar, plus fading lower order terms from the other lesser 
eigenvalues. As long as the flags do not change, this will result in constant steps: 


the difference between consecutive iterates. 


1 


I 


would converge to 


a constant vector, asymptotically as the effects of the smaller eigenvalues fade. That 
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constant vector is an eigenvector for eigenvalue 1. The ISTA iteration will not con¬ 
verge until a flag change in forces a change in the flags If we satisfy the 

conditions for global convergence of ISTA, then such a flag change is guaranteed to 
occur. 


The same situation applies to N, 

w['=l 

1 


aug- The difference between two iterates ^ ^ 


would asyptotically converge to a constant vector. The FISTA iteration will 


not converge until a flag change in forces a change in the flags Such a flag 
change is guaranteed to occur due to the global convergence of FISTA. In Section 6.1, 
we will show that FISTA can jump out of such regime very fast, which is the main 
reason why it is faster than ISTA. See Section 7 for more discussions on its numerical 
behavior. 

[C]. (or has an eigenvalue equal to 1, but the block f J | j is absent. 


For Raug (or N, 


aug;, the convergence rate of this regime will still depend on p(J^^) 
and p(J(^*). If we assume the solution is unique, the eigenvalue 1 of Raug (or Naug) 
must be simple by Lemma 4.1. So the iteration will eventually jump out this type of 
regime. 

When the flag matrix does change, it means the set of active constraints at the 
current passes in the process has changed, and the current pass is a transition to a 
different operator with a different eigenstruture. 

[D]. The operator (or A^I^+^1) will be different from (or due to 

different flag matrix. 

5. Unique Solution: Linear Convergence. 

5.1. Partial Spectral Decomposition. As we remarked at the end of Section 
fk] 

3.2, Nkug is different for different step k. In the Lemma 4.3, we show that, for different 




k, NliJilg is spectrally decomposed by different matrix This section shows that if 

fk] 

unique solution is assumed, then for different k, N(,ug can be spectrally decomposed 
by the same matrix, denoted as Pv- 

Lemma 5.1. Assume that FISTA in Alg. 2 has a unique solution x*. If iteration 
j > K for some K, the stepsize t^^'^ becomes frozen at a constant value i.e. 

~ * = c G [0, 1], then the iteration will converge to the same solution x*. 

Proof. Since x* is the unique solution, it must be the fixed point of the algorithm. 
If the algorithm converges, one must have xt^'l — —>■ 0. In Alg. 2, it is easy to 

see that no matter what c = - — pj ] ~^ is, the converging point will always be the same 
point, the optimal solution x*. □ 


[k] 




as 


Since Alg. 4 is equivalent to Alg. 2, we have the similar statement for Naug, 
shown in the next lemma. 

Lemma 5.2. Assume that the FISTA in Alg. 4 has a unique solution w*ag = 


1 


If iteration j > K for some K, the stepsize becomes frozen at a constant 


number, i.e. 


^“ = ^ = cG[0,l],N 


bl 


the iteration will converge to the same solution w*^g. 

u-i 

simple dominant eigenpair o/Naug. 


will then become a constant matrix and 
In other words, (l,w* ) is a 
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Proof. Once rl-’J is fixed for j > K, then Nalg will be fixed. Due to Lemma 5.1, Alg. 
4 will still converge to the same solution w*^ □ 


The above two lemmas directly indicates the next proposition: 

Proposition 5 . 3 . Though Nailg in Alg. 4 are different for different k, they share 
the same simple dominant eigenpair (A = l,w*^g) as long as the LASSO problem has 
a unique solution. 

\k] 

Now we show that if every N^ug has the same simple dominant eigenpair, then 
for different k, Nailg can be decomposed by the same matrix, denoted as Pn- 

Lemma 5 . 4 . Assuming and FISTA in Alg. 4 has a unique solution, 

then N^fug has a spectral deeomposition Nj^fug = PN^^^Pf/' ■ Note that Pn is the 
same for all and 


(5.1) 


t[^] _ 

'’n — 


1 0 


0 J 


where JJy is a matrix with spectral radius strictly less than 1. 

Proof. If the LASSO problem has a unique solution, by Lemma 5.1 & 5.2, for any 
k, must share the same single dominant eigenpair (A = l,w*^g). This is a 

simple eigenvalue with a fixed left eigenvector = (0, • • • ,0,1). Therefore, we can 
construct Pat = [w*^g,W] where W G ]g( 2 n+i)x 2 n fo^nis the basis for space {z}^ 

and its inverse has the form P)^^ = [z, Z] by scaling, where Z is a basis for 
(w^ug)"*" determined by w*^g, z and W. So Pat is a matrix independent of any 

that for Nailg, 


N 


l\T[fe] p-1 


= [^ 


,W] NWg [z 


Zl^ = 


1 0 


0 J 


□ 


Lemma 5 . 5 . We denote matrix as Naug in which r = 1 , then one can 


write Naig — N^^g + (1 — T[''l)ANaug, where 


N' = 

aug 


(5.2) 


and 


AN, 


aug 


^2AW ^^[-2(1^ 

/ 0 0 

0 0 1 

^Rlk] ^[fc-1] Ay^[3[fe] _3[fc-l]]N 

0 0 0 
0 0 0 


If nW = 14 ['=-11 and Nk'^lg has only a simple eigenvalue equal to 1, then N(,^g 

r^j [ fe] 1 _T 

must also have a simple eigenvalue equal to 1 and Naug = PaJjvPat =PMJA'PAr + 


(1 — r['=l)PAf JatvPat^ with the same Pn in Lemma 5.4. Note that 


^0 

Jat' 


and 3 an = 


, where 3 n' and 3 an are matrices with spectral radius strictly 


^ ^0 
yO Jam 

less than 1. Consequently, = Jn'+ (1 - r['=l)J AN 
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Proof. By Lemma 3.5(b), eigenvalue of must lie in the disk ^’( 72 , 72 )- Be¬ 
sides, now that has only a simple eigenvalue equal to 1, by Lemma 3.5(c), 


should have no eigenvalue equal to 1. This indicates matrix 


2i?W 

/ 0 


has 


no eigenvalue equal to 1. And hence must also have a simple eigenvalue equal 


to 1. 


Due to Lemma 5.2, (1, w*^ ) is in the right nullspace of — I. Additionally. 


(0, • • • ,0,1)™ is in the left nullspace of NJ^^g — /. Therefore, one can construct the 

1 ^0 

0 3n' 


same Pn in Lemma 5.4 such that N(^^g = 


PjY . According to the 


relation N, 
0 of AN, 


aug 


NLug + (1 - ^'"')AN, 


aug- 


aug, this Pat must work as well for eigenvalue 

□ 


5.2. Local Linear Convergence. In the case that (1.1) has a unique solution 
with strict complementarity, we can give a guarantee that eventually the flag matrix 
will not change. By strict complementarity, we mean that for every index i, w* ^ 
±^^(01 w* ^ 3z^/l)- Once the flag matrix stays fixed, the ISTA (or FISTA) iteration 
behaves just like the power method (or similar to power method) for the matrix 
eigenvalue problem. In this case, the spectral theory developed here gives a guarantee 
of linear convergence with the rate equal to the second largest eigenvalue of the matrix 
operator. 

In this section we will use the loo norm of a vector: ||v||oo = max^ |ui|, and the 
associated induced matrix norm ||A||oo = max^ \o-ij\- We will also use the matrix 
G-norm where G is a non-singular matrix, defined to be HxIIg = ||Gx||oo for any 
vector X, and ||A ||(3 = ||GAG“^||oo for any matrix A. The following lemma relates the 
vector oo-norm to the vector 2 -norm. 

Lemma 5.6. For any n-vectors a, b, (||a||oo + ||b||oo)^ < 2(||a||2 -I- HblH) 

We refer readers to [3] for the proof. 

Under the assumption of strict complementarity and unique solution, we can 
prove the specific result that ISTA (or FISTA) iteration must eventually reach and 
remain in “linear convergence” regime [A]. First we note that by Lemma 3.1 (or 3.2), 
this solution must correspond to a unique eigenvector of eigenvalue 1 for the matrix 
R-aug (or Naug) when the flag matrix does not change. Additionally, 

by Lemma 4.1, the matrix R (or N) has no eigenvalue equal to 1, and by Lemma 3.4 
(or 3.5) all the eigenvalues must be strictly less than 1 in the absolute value. Hence 
the following lemma applies to this situation. 

Lemma 5.7. Consider the general augmented matrix and its eigenvector 

(5.3) Maug = and ^ such that MaugW^^g = 

where M is any nx n matrix such that the spectral radius p of M satisfies p{M) < 1. 
The vector w* is the unique eigenvector corresponding to eigenvalue 1, scaled so that 
its last element is = 1. Then the following holds. 

(a) . For any e > 0, there is a matrix norm || • ||g such that p{M) < ||M||g < 
p{M) + e. In particular, one can choose G so that ||M||g < 1. Also, there is a 
positive constant Gi (depending on M & G) such that for any vector or matrix X, 

II^IIg < GillAlU and ||A|U < Ci||A||g. 

(b) . The iterates of the power iteration w^^ug^^ = MaugWa^g satisfy ||wa^ig — 
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w*ug||G < Il-Mllgllwaug — w*^g||G and hence converge linearly to w*^g at a rate 
bounded by p{M) + e where e is the same arbitrary constant used in (a). This is a 
special case of the theory behind the power method for computing matrix eigenvalues. 

(c). Given any positive constant C 2 , */Waug is any vector such that ||w£lg — 
w*ag||oo < <^ 2/(^1 then ||(Maug)''wL°J,g-w*^g||oo < C 2 for all k. In particular, if w* 
is bounded away from two points and C 2 = mini{|w* — | — e, |u>* + -y^ | — e} > 0 , 

then any element of vector (Maug)^'Waug should he bounded away from two points 
for all k = 0,1,2, ■■■. 

Proof. The proof of (a) and (b) are omitted. They are similar to those of the lemma 

6.2 in [3]. For (c), we make more comments. Define Gaug = with the G 

from part (a), and define the corresponding Gaug-norm on the augmented quantities. 
Define the following balls around the eigenvector w*^g: 

Bi ={Waug • ||Waug ~ '''^auglloo ^ G 2 } 

(5.4) B 2 ={Waug : llWaug - W^^gHc^ag < G 2 /G 1 } 

B 3 ={Waug : ||Waug ~ 1100 < C 2 /C 1 }. 

From part (a), B 3 C B 2 '!= Bi. From part (b), if any power method iterate satisfies 
w[°l G B 2 , then all subsequent iterates stay in B 2 . Hence if the power method starts 
in S 3 , all subsequent iterates will lie in Si. □ 

We now invoke the global convergence property of ISTA and FISTA. 

Theorem 5.8. If problem 1.1 is solvable, i.e. X* := argminF(x), where S(x) = 
y 2 ||Ax — b||| + A||x||i. If let x[°i be any starting point in R" and x^^^l he the sequence 
generated by ISTA. Then for any k > 1, Vx* G X*, F(x[^l) — S(x*) decreases at 
the rate of Otflif). On the other hand, if let = xl'^l be any starting point in R", 
= 1 and {x^}, {^^’1} be the sequence generated by FISTA. Then for any 

k> 1, Vx* G X*, we have F(x[^l) — F(x*) decreases at the rate of 

This is a restatement of the convergence theorem in [1]. It says little on the 
local behavior of the algorithm. Under the assumption of the unique solution, it 
guarantees that eventually the iterates will converge to the optimal value. With the 
iterates convergence, we present our main results in the following theorems. 

Theorem 5.9. Suppose the LASSO problem 1.1 has a unique solution and this 
solution has strict complementarity: that is for every index i, w* Then 

eventually the ISTA iteration reaches a stage where it converges linearly to that unique 
solution. 

Proof. Since for any index i, w* ^ (where j is the pass number) could only 

be in one of three cases: > -y^, < —^/l or —^/l < ^ ^/l- We can let 

C 2 = min{|{t;* — — e, |u;* + -y^il — e} > 0 for a positive constant e sufficiently small 

to make G 2 > 0 . 

By theorem 5.8, there exists a pass k such that ||wl^l — w*||^ < (G 2 /G 1 ). Hence 
wt^'l lies in B 3 stated in Lemma 5.7(c), and = DiAG(siGN(Shry (w W))) 

is the 

associated flag matrix. By Lemma 5.7(c), there exists a pass K such that lies in 
Bi (i.e. ||wLug — w*^g||oo < G 2 ) for all k > K. This means that will remain in 
one of three cases: wf > -y^, w’f < —^/l or —^/l If: wf < -y^ and will never change 
to another case for all k > K. This, combined with the definition of flag matrix 
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= DlAG(siGN(ShrA/ ( wl^l))), implies that the flag matrix remain unchanged for 
/ L 

all k > K. Thus starting at the K-th pass, the ISTA iteration reduces to the power 
method on the matrix Rkug, converging linearly to the unique eigenvector at a rate 
given by Lemma 5.7(b). □ 


Theorem 5.10. Suppose the LASSO problem (1.1) has a unique solution and 
this solution has strict complementarity: that is for every index i, w* ^ ^^II- Then 
eventually the FISTA iteration reaches a stage where it converges linearly to that 
unique solution. 

Proof. As the proof for Theorem 5.9, one can construct = minjlu;* — ^/l\ — 
e', Ifc* + ^/i^\ — e'} for a positive constant e' sufficiently small to make > 0. By 
Theorem 5.8 and Lemma 5.7(c), there exists a pass Ki such that lies in (i.e. 
Ilwi'dg - w*„g||oo < C' 2 ) for all k > Ki. It also means the flag matrix will remain 
unchanged for all k > Ki. Thus, by Lemma 5.4 


(5.5) 




aug 


aug aug 

i].p—i. 


— ir ]\fj jy Ir Ir ]\fJ jy 


= P 


N 


= P 


N 


N 

0 

j[k+l-l] 

'n 


[k] 

aug *’ aug 

[A;+^ — 2] .p— 1 


N 

0 

i[k+l- 2 ] 

*N 


^[fe+Z-l]^[fc+i-2] ^[fc] 


■ ■ P AT J 

1 


jWp-lwW 


N ^aug 


0 J 


N 


1 [/c] 

aug 


N 


''aug- 


For each we can write = Jtv' + (1 — t[^1)Ja7 V, where Jat', Jam are defined 
in Lemma 5.5. Due to the fixed flag matrix after AT-th pass, Jm' and Jam remain 
fixed for all k > Ki. By Lemma 5.7, Ve > 0 with e < 1 — p{3jq'), 3 || • Hg such 
that ||Jm'||g < p{3n') + 72 ^ < 1 — ^ 2 ^- Since Jam is fixed, there must exist a pass 
K 2 {> Ki) such that (1 — t^^I) • ||Jam||g < 72 ^ for all k > K 2 . Therefore, starting at 
Ar 2 -th pass, one has ||J{^^||g < ||Jm'||g + (1 - t[''1)IIJam||g < /o(Jm') + e < 1. 


(5.6) 


^ N aug 


=p 


1 

0 

7^^ 


f[fe] 

’n 


(P 


— 1 ~ * 

N ^aug 


’aug 


+ ykug) 


f[fe] 

’n 


_ p 

./aug N ”aug 


-1 




' J aug 


(5.7) IlyiVg'llG < 0(||j'^+''||||j'^+'-''|| • • •) < O((p(Jm0 + £)''=+'') ^ 0. 

Therefore, ||wLug — w*^g|| converge linearly at a rate bounded by p(Jm') + e < 1. 

□ 


6. Acceleration. It is known that FISTA exhibits a global convergence rate 
0 ( 7 ^ 2 ), which accelerates ISTA’s convergence rate 0(7fc). Compared to this worst 
case convergence result, we analyze how FISTA and ISTA behave through all iterations 
on the perspective of spectral analysis we establish in this paper. First, we characterize 
one important property based on three possible regimes. 

Lemma 6.1. Suppose R and N have the same the flag matrix, ISTA and FISTA 
have the following relations: 
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(a) . If FISTA is in regime [A] or [C], then so is ISTA, and vice versa. 

(b) . If FISTA is in regime [B], then so is ISTA, and vice versa. 


Proof. We note that if FISTA and ISTA start at the same iterate, we have D = D, 
hence R defined in (3.15) is exactly operator R defined in (3.5). 

(a). If FISTA is in regime [A] or [C], then N either has no eigenvalue equal to 
1 or has a complete set of eigenvectors associated with eigenvalue 1. In other words, 
the augmented matrix Naug must have a complete set of eigenvectors for eigenvalue 


1. Let 


be the eigenvector for eigenvalue 1, then 


( 6 . 1 ) 


{N-I) 





—tR h\ 



X = y (by second row) 

Ax — X + h = {R — /)x + h = 


^ ? a 




0 


= 0 


Therefore, becomes the eigenvector for eigenvalue 1 of Raug- R must either 

have no eigenvalue equal to 1 (in regime [A]) or have a complete set of eigenvectors 
associated with eigenvalue 1 (in regime [C]). The opposite direction follows by similar 
argument. 

(b). Since one of the regimes [A], [B], [C] must occur, this statement can be 
considered as the contraposition of (a). □ 


This lemma suggests that both ISTA and FISTA are in the same regime as long 
as both operators have the same flag matrix. It motivates one to compare in each 
regime between FISTA and ISTA when starting from the same starting point (which 
results in the same flag matrix). By assuming the same starting point and a fixed 
flag matrix, we have and thus R = R, h = h. We 

will use these notations interchangeably and omit [^1 for anything but iterates in the 
following analysis. It turns out that FISTA is faster in regime [B], but not always 
faster in regimes [A] and [C] depending on the parameter . 


6.1. In Regime [B]. In regime [B], as mentioned in Section 5.2, there exist 
Jordan chains such that the difference of iterates will converge to a constant step. Let 

\ 

and be two consecutive iterates for ISTA 

V 1 / V 1 / 

and FISTA, respectively. In the following lemmas, we will show that the constant 
step for FISTA is larger than ISTA when starting at the same point, which yield a 
speedup. 



Lemma 6.2. The constant step vector for ISTA is where v = Rv is an 

eigenveetor of R. 

Proof. For ISTA, there must be a Jordan block J]:j for the augmented matrix Raug- 
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Then there exists a Jordan chain such that 


( 6 . 2 ) 


R h 

0 1 


w + V 
1 


and 


R h 

0 1 


In other words, each pass of ISTA will add a constant vector 

□ 


in regime [B]. 


( cv' 

cv I , where v 
0 

is the same v in lemma 6.2, c is a scalar to he determined. 

Proof. Assume the constant vector is ( V 2 | . Then basic iteration of FISTA is 

0 


wW 1 =N 



(1 + t)R —tR h 
/ 0 0 
0 0 1 



Due to the presence of Jordan block 


1 1 
0 ly’ 


there exists a Jordan chain 


(6.3) 


/ wl'^l \ 

, / 

^[k-1] 

= 

\ 1 

' \ 


Vl' 


0 


'Vl' 


0 


'Vl 


0 


In the second equation of (6.3), the second row implies Vi = V 2 . Then, the first 
row implies i?vi = Vi. Since both Vi and v are eigenvectors for eigenvalue 1 of R, 
we can write Vi = cv where c is a scalar to be determined. Hence the constant step 

should be I cv I . □ 

0 

Lemma 6.4. Suppose ISTA and FISTA start from the same point in the same 
regime [B], i.e. then c in Lemma 6.3 equals to where t is a scalar 

(A 

close to 1. The constant step vector for FISTA is > which is larger than 

" Vo/ 


V I , the ISTA constant step. 

Proof. By Lemma 6.3, the equation (6.3) expands to 


(6.4) 




/ wt'^l \ 

--- 

1 



V 1 J 


'(1 + t)R —tR h' 

I 0 

0 0 1 , 

'((1 + t)R - J)wW - + h\ 

wW - 

0 
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which is supposed to be equal to 


From the second row, 


cv 


or — cv. Hence, the first row should be cv = ((1 + t)R — — 

+ h= ((1 + T)i?--Ti?(wW -cv) + h= (i? - + h +crv. The 
last equality follows by Rv = v. 

If FISTA and ISTA start from the same point = wl^l, then cv = {R — 
+ h + CTV = {R — + h + crv = v + crv, leading to c(l — r) = 1. Hence 


Lemma 6.4 indicates that if FISTA and ISTA starts from the same starting point 
in one specific regime [B], then it will cost FISTA fewer iterations to leave this regime 
with larger constant step. Hence it is an acceleration compared to ISTA in regime 
[B]. 

6.2. In Regimes [A] and [C]. On the other hand, in regimes [A] and [C], the 
convergence rate of the two algorithms are related to the spectral radius of R and N. 
Particularly, the rate of FISTA depends on r and the iteration number since r is a 
determined sequence based on iteration numbers. Let /?, 7 denote an eigenvalue of 
R, N, respectively, and /?max, 7max denote the corresponding eigenvalues of largest 
absolute value. As stated in Section 4.2, we must have 1 > /?max, 7max > 0 in regimes 
[A] or [C]. In addition, by Lemma 3.5, /3 and 7 satisfy the relation 7^ — 7 (l+T)/ 3 +r /3 = 
0. Let 7i and 72 be two roots of 7. We conclude our result in the following proposition. 

Proposition 6.5. Suppose ISTA and FISTA start from the same point in a 
certain regime [A] or [C] and FISTA is faster than ISTA */ 1 > 

/?max > T > 0 hut slower if 0 < /3max < T < 1. If T > /3max then the eigenvalue 
7 max of N of largest absolute value is one of a pair of complex conjugate eigenvalues. 
Because /3max is a fixed value for one specific regime, with the r growing to 1, ISTA 
will be faster than FISTA toward the end. 

Proof. We prove this proposition in three steps. 

(a) . From 7 ^ — 7(1 + r)/3 + t/3 = 0, 7 has real roots if 1 > /3 > and has 

complex roots if > /? > 0 . 

(b) . If > /3 > 0, then 71 and 72 are two conjugate roots such that I 71 P = 

7172 = t/ 3 . Note that 0 < r < I, we separated into two cases. If > /3 > r, 

then I 7 I = ^/TP < p. If r > /3 > 0, then I 7 I = > /3. 

(c) . If 1 > /3 > ■> T < /3, and 71 and 72 are real and 71 = max{ 7 i, 72 } = 

( 1 ^ ^ ^ { 1 +^ ^ ^ inequality is due 

to /3 > T and the second one is due to r < 1. 

Since (b) and (c) are true for any pair of 7 and /3, combining them together, we 
get | 7 inax| < |, 8 max| if /3max > T and | 7 inax| > |, 8 max| if /3max < T. In such a regime [A] 
or [C], as stated in Section 5.2, both ISTA and FISTA iteration can be reduced to the 
power method or similar to power method on the operator Raug and Ngug and the 
rate is determined by the |/3max| and lymaxl- We complete the proof. □ 


Proposition 6.5 concludes that if the starting points are the same in regimes [A] 
or [C], then ISTA will first be slower but then be faster as the iteration progresses. 

7. Examples. Example 1. We illustrate the eigen-analysis of the behavior of 
ISTA and FISTA and then compare them by the propositions in Section 6 based on a 


20 















Fig. 1. ISTA (left) and FISTA (right) on Example 1: Curves A: — x*||. B.- ||x[*^l—xl*^ 


uniform randomly generated LASSO problem. Specifically, in problem (1.1), A and b 
are generated independently by a uniform distribution over [—1,1], A being 20 x 40. 
Since A are drawn by a continuous distribution, as noted in Lemma 4 of [18], problem 
1.1 must have a unique solution. Figure 1 shows the ISTA and FISTA convergence 
behavior. Using the notations from Theorems 5.9 & 5.10, the figures show the error 
of X, — x*|| (A: top curve) and the difference between two consecutive iterates 
of x: ||x[^l — x[^“^l|| (B: bottom curve) 

Figure I (left) shows the behavior of ISTA. ISTA takes 5351 iterations to converge 
and the flag matrix D changes 25 times in total. During the first 174 iterations, the 
iteration passes through a few transitional phases and the flag matrix D changes 20 
times. After that, from iteration 175 to 483, it stays in regime [B] with an invariant 
D. Then from iteration 484 to 530, from 531 to 756, from 767 to 4722 and from 4723 
to 4972, it passes through four different regimes [B]. Within each regime [B], the flag 
matrix D is invariant. According to our analysis in Section 4.2, there exists a Jordan 
chain in each of these regimes [B], indicating that we are indeed in a “constant step” 
regime. In other words, the difference between two consecutive iterates jjxl^l — x[^“^l|| 
quickly converges to Raug’s eigenvector for eigenvalue 1 in each of these regimes [B]. 
Taking iterations from 767 to 4972 for example, one could notice curve B in Figure 
1 (ISTA) that jjxl^'l — xf^“^l|| is a constant from iteration 767 to 4722. Finally, at 
iteration 4973, it reaches and stays in the final regime [A], converging linearly in 378 
steps. Indeed, the iterates are close enough to the final optimum so that the flags 
never change again. The linear convergence rate depends on the spectral radius of 
i?, i.e. upper left part of Raug, which is p[R) = 0.9817, separated from the Raug’s 
largest eigenvalue 1, consistent with Theorem 5.9. 

Figure l(right) shows the behavior of FISTA. FISTA takes 1017 iterations to 
converge and the flag matrix D changes 42 times in total. After flag matrix D changes 
42 times in initial 258 iterations, it reaches the final regime [A] at iteration 259 and 
converges linearly in 758 steps. Since Naug varies at each iteration due to varying r, 
the convergence rate changes very slightly step by step. The spectral radius of N, i.e. 
upper left part of the operator Naug in the last step is p{N) = 0.9914. Actually, the 
largest eigenvalues of N are a complex conjugate pair, 0.9843 ± 0.1185i. Note that 
they are complex numbers because of the increasing r, as stated in Proposition 6.5 in 
Section 6. Hence, in the final regime, the convergence to eigenvector for eigenvalue 1 
of Naug will oscillate in the invariant subspaces spanned by the eigenvectors of two 
conjugate complex pairs. This explains why curves in Figure 1 (FISTA) oscillates in 
the latter part of the FISTA convergence. 

Figure 2 shows the eigenvalues of the operators Raug and Naug during the final 
regime. One notices that the eigenvalues for the Raug from (3.5) lie strictly on the 
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Fig. 2. ISTA (left) and FISTA (right) on Example 1: Eigenvalues of ISTA operator Raug CLud 
FISTA operator Naug on the complex plane during the last regime of the iteration process. The 
unit circle and V 2 ) Q-tc shown for reference. 


interval (0,1) and eigenvalues for Naug He close to the boundary but strictly inside 
the circle ^/ 2 ) (except for 0 and 1), consistent with Lemmas 3.4 & 3.5. 

Comparing ISTA with FISTA, we make two remarks based on our propositions 
in Section 6: 

a. It costs FISTA many fewer steps (259 iterations) than ISTA (4973 iterations) 
to get to the hnal regime. The main reason is that FISTA has much larger constant 
steps in regime [B] so that it can jump out of that regime more quickly. Though 
this will lead to more changes of regimes (flag matrix D changes 42 times, 17 more 
times than ISTA), the overall iteration numbers have been cut down, consistent with 
Lemma 6.4 in Section 6. One can also notice this in Figure l(FISTA) that difference 
of iterates will not remain at a constant number for a long time and the iterates will 
be soon in the final regime. 

b. At iteration 259 of FISTA, r = 0.9886 while p{R) = 0.9817. Proposition 6.5 
predicts p{N) > p{R) for the rest of the iterations. Indeed, p{N) = 0.9914 > p{R). 
This means ISTA is faster than FISTA in each of their final regime. In other words, 
if one detects the arrival of final regime and then change from FISTA to ISTA at step 
259, the algorithm (denoted as F/ISTA in Figure 3) should have converge faster than 
the standard FISTA. As shown in Figure 3, the algorithm of this idea converges only 
in 696 iterations with the same accuracy compared to 1017 iterations of FISTA. 

Example 2 We consider an example of compressed sensing. The focus of this 
example is not on the efficiency comparison among different methods but to show its 
local behavior to support our analysis. Suppose there exists a true sparse signal repre¬ 
sented by a n-th dimension vector x with k non-zero elements. We observe the image 
of Xs under the linear transformation Ax^, where A is the so-called measurement 
matrix. Our observation thus should be 

(7.1) b = Axs-|-e 

where e is the observation noise. The goal is to recover the sparse vector x^ from 
the measurement matrix A and observation b. For this example, we let A € 
be Gaussian matrix whose elements are i.i.d distributed as A/'(0,1) with m = 128 
and n = 1024, e be a vector whose elements are i.i.d distributed as A/'(0, cr^) with 
a = 10“^. The original true signal for the problem is generated by choosing the 
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Fig. 3. Curve C and D show hybrid F/ISTA on Example 1, i.e. follow FISTA during initial 259 
iterations to reach the final regime and then switch to ISTA until it converges. The total iteration 
number is 696, far less than 1017 iterations of FISTA alone with the same accuracy. The FISTA 
on Example 1 (curve A and B) is shown for reference. 


locations of x’s k{= 10 ) nonzeros uniformly at random, then setting those locations 
to values drawn from jV(0 , 2^). 

We solve this compressd sensing problem by model (1.1) with both ISTA and 
FISTA method. For ISTA, we set A = 1 and the final recovered signal Xopt , i.e. 
the optimal solution of model (1.1) under A = 1, has the relative error ||Xopt — 
^s||/||xs|| = 5.34 X 10“^. As for FISTA, we set A = 0.5 and the final recovered signal 
Xopt, i.e. the optimal solution of model (1.1) under A = 0.5, has the relative error 
popt - x^ll/p^ll = 2.65 X 10“^. 

It costs ISTA 2822 iterations to reach the final regime and finally converges in 
totally 3001 iterations. On the other hand, it costs FISTA 372 iterations to reach the 
hnal regime and converges in totally 717 iterations. Figure 3 show their convergence 
behavior. It can be seen that curves of the difference of iterates in both two figures 
remain at a constant number for many iterations. This is because they are in the 
constant regimes such that the difference between consecutive iterates (curves B) are 
converging to a constant vector. But such iteration number for FISTA obviously is 
shorter than ISTA because it has a larger constant step size, as we indicated in Section 
6 . 

Finally, both algorithms has linear convergence for the hnal regimes. The linear 
convergence rate for ISTA is the second largest eigenvalue of Raug, which equal to 
0.9587. The linear convergence rate for FISTA from step 372 to the last step 717 

[All 

is the second largest eigenvalue of N^ug, which remains at 0.9752 for k from 372 to 
717, slower than ISTA rate. From this example, by the time FISTA reaches the hnal 
regime, r is so close to 1 that the rate for FISTA is changing very little. At iteration 
372, = 0.9920, which is already greater than 0.9587, predicting that switching 

to ISTA at this point would be advantageous. 

Particularly, if one detects the arrival of hnal regime and then changes from 
FISTA to ISTA, the algorithm (denoted as F/ISTA in Figure 4) will have a faster 
linear convergence rate. Much computational cost will be reduced. As shown in 
Figure 4, the algorithm of this idea converges only in 494 iterations with the same 
accuracy compared to 717 iterations of FISTA. 

8. Conclusion. In this paper, we show the locally linear convergence of ISTA 
and FISTA, applied to the LASSO problem. Extending the same techniques as in 
[3], both algorithms can be modeled as the matrix recurrence form and thus the 
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Fig. 4. Left: ISTA on Example 2. Right: Curve A and B is the FIST A on Example 2. Curve C 
and D is the local convergence behavior of F/ISTA on Example 2, i.e. first run FISTA during initial 
372 iterations to reach the final regime and then run ISTA until it converges. The total iteration 
number is 494, fo.r less than 717 iterations of FISTA with the same accuracy. 


spectrum can be used to analyze their convergence behaviors. It is shown that the 
method normally passes through several regimes of four types and eventually settles 
on a ‘linear regime’ in which the iterates converge linearly with the rate depending 
on the absolute value of the second largest eigenvalue of the matrix recurrence. 

In addition, we provide a way to analyze every type of the regime. Such analysis 
in terms of regimes allows one to study the aspect of acceleration of FISTA. It is well 
known that FISTA is faster than ISTA according the worst case complexity bound. 
Our analysis gives another way to show how both methods behave during the whole 
iterations. It turns out that FISTA is not always faster than ISTA in regime [A] and 
[C], depending on the continually growing stepsize. But in general FISTA is faster 
because of its acceleration in regime [B]. 
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